Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
JAMA Ophthalmol ; 142(3): 226-233, 2024 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-38329740

RESUMO

Importance: Deep learning image analysis often depends on large, labeled datasets, which are difficult to obtain for rare diseases. Objective: To develop a self-supervised approach for automated classification of macular telangiectasia type 2 (MacTel) on optical coherence tomography (OCT) with limited labeled data. Design, Setting, and Participants: This was a retrospective comparative study. OCT images from May 2014 to May 2019 were collected by the Lowy Medical Research Institute, La Jolla, California, and the University of Washington, Seattle, from January 2016 to October 2022. Clinical diagnoses of patients with and without MacTel were confirmed by retina specialists. Data were analyzed from January to September 2023. Exposures: Two convolutional neural networks were pretrained using the Bootstrap Your Own Latent algorithm on unlabeled training data and fine-tuned with labeled training data to predict MacTel (self-supervised method). ResNet18 and ResNet50 models were also trained using all labeled data (supervised method). Main Outcomes and Measures: The ground truth yes vs no MacTel diagnosis is determined by retinal specialists based on spectral-domain OCT. The models' predictions were compared against human graders using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), area under precision recall curve (AUPRC), and area under the receiver operating characteristic curve (AUROC). Uniform manifold approximation and projection was performed for dimension reduction and GradCAM visualizations for supervised and self-supervised methods. Results: A total of 2636 OCT scans from 780 patients with MacTel and 131 patients without MacTel were included from the MacTel Project (mean [SD] age, 60.8 [11.7] years; 63.8% female), and another 2564 from 1769 patients without MacTel from the University of Washington (mean [SD] age, 61.2 [18.1] years; 53.4% female). The self-supervised approach fine-tuned on 100% of the labeled training data with ResNet50 as the feature extractor performed the best, achieving an AUPRC of 0.971 (95% CI, 0.969-0.972), an AUROC of 0.970 (95% CI, 0.970-0.973), accuracy of 0.898%, sensitivity of 0.898, specificity of 0.949, PPV of 0.935, and NPV of 0.919. With only 419 OCT volumes (185 MacTel patients in 10% of labeled training dataset), the ResNet18 self-supervised model achieved comparable performance, with an AUPRC of 0.958 (95% CI, 0.957-0.960), an AUROC of 0.966 (95% CI, 0.964-0.967), and accuracy, sensitivity, specificity, PPV, and NPV of 90.2%, 0.884, 0.916, 0.896, and 0.906, respectively. The self-supervised models showed better agreement with the more experienced human expert graders. Conclusions and Relevance: The findings suggest that self-supervised learning may improve the accuracy of automated MacTel vs non-MacTel binary classification on OCT with limited labeled training data, and these approaches may be applicable to other rare diseases, although further research is warranted.


Assuntos
Aprendizado Profundo , Telangiectasia Retiniana , Humanos , Feminino , Pessoa de Meia-Idade , Masculino , Tomografia de Coerência Óptica/métodos , Estudos Retrospectivos , Doenças Raras , Telangiectasia Retiniana/diagnóstico por imagem , Aprendizado de Máquina Supervisionado
2.
PLoS One ; 19(2): e0297271, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38315667

RESUMO

Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.


Assuntos
Algoritmos , Instalações de Saúde , Decoração de Interiores e Mobiliário , Conhecimento , Aprendizado de Máquina
3.
Nature ; 626(8000): 881-890, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38297124

RESUMO

The pace of human brain development is highly protracted compared with most other species1-7. The maturation of cortical neurons is particularly slow, taking months to years to develop adult functions3-5. Remarkably, such protracted timing is retained in cortical neurons derived from human pluripotent stem cells (hPSCs) during in vitro differentiation or upon transplantation into the mouse brain4,8,9. Those findings suggest the presence of a cell-intrinsic clock setting the pace of neuronal maturation, although the molecular nature of this clock remains unknown. Here we identify an epigenetic developmental programme that sets the timing of human neuronal maturation. First, we developed a hPSC-based approach to synchronize the birth of cortical neurons in vitro which enabled us to define an atlas of morphological, functional and molecular maturation. We observed a slow unfolding of maturation programmes, limited by the retention of specific epigenetic factors. Loss of function of several of those factors in cortical neurons enables precocious maturation. Transient inhibition of EZH2, EHMT1 and EHMT2 or DOT1L, at progenitor stage primes newly born neurons to rapidly acquire mature properties upon differentiation. Thus our findings reveal that the rate at which human neurons mature is set well before neurogenesis through the establishment of an epigenetic barrier in progenitor cells. Mechanistically, this barrier holds transcriptional maturation programmes in a poised state that is gradually released to ensure the prolonged timeline of human cortical neuron maturation.


Assuntos
Epigênese Genética , Regulação da Expressão Gênica no Desenvolvimento , Células-Tronco Embrionárias Humanas , Células-Tronco Neurais , Neurogênese , Neurônios , Adulto , Animais , Humanos , Camundongos , Antígenos de Histocompatibilidade/metabolismo , Histona-Lisina N-Metiltransferase/antagonistas & inibidores , Histona-Lisina N-Metiltransferase/metabolismo , Células-Tronco Embrionárias Humanas/citologia , Células-Tronco Embrionárias Humanas/metabolismo , Células-Tronco Neurais/citologia , Células-Tronco Neurais/metabolismo , Neurogênese/genética , Neurônios/citologia , Neurônios/metabolismo , Fatores de Tempo , Transcrição Gênica
4.
bioRxiv ; 2023 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-37986761

RESUMO

Proteomics has been revolutionized by large pre-trained protein language models, which learn unsupervised representations from large corpora of sequences. The parameters of these models are then fine-tuned in a supervised setting to tailor the model to a specific downstream task. However, as model size increases, the computational and memory footprint of fine-tuning becomes a barrier for many research groups. In the field of natural language processing, which has seen a similar explosion in the size of models, these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we newly bring parameter-efficient fine-tuning methods to proteomics. Using the parameter-efficient method LoRA, we train new models for two important proteomic tasks: predicting protein-protein interactions (PPI) and predicting the symmetry of homooligomers. We show that for homooligomer symmetry prediction, these approaches achieve performance competitive with traditional fine-tuning while requiring reduced memory and using three orders of magnitude fewer parameters. On the PPI prediction task, we surprisingly find that PEFT models actually outperform traditional fine-tuning while using two orders of magnitude fewer parameters. Here, we go even further to show that freezing the parameters of the language model and training only a classification head also outperforms fine-tuning, using five orders of magnitude fewer parameters, and that both of these models outperform state-of-the-art PPI prediction methods with substantially reduced compute. We also demonstrate that PEFT is robust to variations in training hyper-parameters, and elucidate where best practices for PEFT in proteomics differ from in natural language processing. Thus, we provide a blueprint to democratize the power of protein language model tuning to groups which have limited computational resources.

5.
SN Comput Sci ; 4(4): 402, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37214587

RESUMO

Grammar is a key input in grammar-based genetic programming. Grammar design not only influences performance, but also program size. However, grammar design and the choice of productions often require expert input as no automatic approach exists. This research work discusses our approach to automatically reduce a bloated grammar. By utilizing a simple Production Ranking mechanism, we identify productions which are less useful and dynamically prune those to channel evolutionary search towards better (smaller) solutions. Our objective in this work was program size reduction without compromising generalization performance. We tested our approach on 13 standard symbolic regression datasets with Grammatical Evolution. Using a grammar embodying a well-defined function set as a baseline, we compare effective genome length and test performance with our approach. Dynamic grammar pruning achieved significantly better genome lengths for all datasets, while significantly improving generalization performance on three datasets, although it worsened in five datasets. When we utilized linear scaling during the production ranking stages (the first 20 generations) the results dramatically improved. Not only were the programs smaller in all datasets, but generalization scores were also significantly better than the baseline in 6 out of 13 datasets, and comparable in the rest. When the baseline was also linearly scaled as well, the program size was still smaller with the Production Ranking approach, while generalization scores dropped in only three datasets without any significant compromise in the rest.

6.
Nat Commun ; 14(1): 1177, 2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36859488

RESUMO

Cryptic pockets expand the scope of drug discovery by enabling targeting of proteins currently considered undruggable because they lack pockets in their ground state structures. However, identifying cryptic pockets is labor-intensive and slow. The ability to accurately and rapidly predict if and where cryptic pockets are likely to form from a structure would greatly accelerate the search for druggable pockets. Here, we present PocketMiner, a graph neural network trained to predict where pockets are likely to open in molecular dynamics simulations. Applying PocketMiner to single structures from a newly curated dataset of 39 experimentally confirmed cryptic pockets demonstrates that it accurately identifies cryptic pockets (ROC-AUC: 0.87) >1,000-fold faster than existing methods. We apply PocketMiner across the human proteome and show that predicted pockets open in simulations, suggesting that over half of proteins thought to lack pockets based on available structures likely contain cryptic pockets, vastly expanding the potentially druggable proteome.


Assuntos
Trabalho de Parto , Proteoma , Humanos , Gravidez , Feminino , Descoberta de Drogas , Simulação de Dinâmica Molecular , Redes Neurais de Computação
7.
BMC Public Health ; 22(1): 2394, 2022 12 20.
Artigo em Inglês | MEDLINE | ID: mdl-36539760

RESUMO

BACKGROUND: Despite an abundance of information on the risk factors of SARS-CoV-2, there have been few US-wide studies of long-term effects. In this paper we analyzed a large medical claims database of US based individuals to identify common long-term effects as well as their associations with various social and medical risk factors. METHODS: The medical claims database was obtained from a prominent US based claims data processing company, namely Change Healthcare. In addition to the claims data, the dataset also consisted of various social determinants of health such as race, income, education level and veteran status of the individuals. A self-controlled cohort design (SCCD) observational study was performed to identify ICD-10 codes whose proportion was significantly increased in the outcome period compared to the control period to identify significant long-term effects. A logistic regression-based association analysis was then performed between identified long-term effects and social determinants of health. RESULTS: Among the over 1.37 million COVID patients in our datasets we found 36 out of 1724 3-digit ICD-10 codes to be statistically significantly increased in the post-COVID period (p-value < 0.05). We also found one combination of ICD-10 codes, corresponding to 'other anemias' and 'hypertension', that was statistically significantly increased in the post-COVID period (p-value < 0.05). Our logistic regression-based association analysis with social determinants of health variables, after adjusting for comorbidities and prior conditions, showed that age and gender were significantly associated with the multiple long-term effects. Race was only associated with 'other sepsis', income was only associated with 'Alopecia areata' (autoimmune disease causing hair loss), while education level was only associated with 'Maternal infectious and parasitic diseases' (p-value < 0.05). CONCLUSION: We identified several long-term effects of SARS-CoV-2 through a self-controlled study on a cohort of over one million patients. Furthermore, we found that while age and gender are commonly associated with the long-term effects, other social determinants of health such as race, income and education levels have rare or no significant associations.


Assuntos
COVID-19 , Humanos , COVID-19/epidemiologia , SARS-CoV-2 , Determinantes Sociais da Saúde , Fatores de Risco , Comorbidade
8.
JMIR Public Health Surveill ; 8(11): e38898, 2022 11 08.
Artigo em Inglês | MEDLINE | ID: mdl-36265135

RESUMO

BACKGROUND: Several risk factors have been identified for severe COVID-19 disease by the scientific community. In this paper, we focus on understanding the risks for severe COVID-19 infections after vaccination (ie, in breakthrough SARS-CoV-2 infections). Studying these risks by vaccine type, age, sex, comorbidities, and any prior SARS-CoV-2 infection is important to policy makers planning further vaccination efforts. OBJECTIVE: We performed a comparative study of the risks of hospitalization (n=1140) and mortality (n=159) in a SARS-CoV-2 positive cohort of 19,815 patients who were all fully vaccinated with the Pfizer, Moderna, or Janssen vaccines. METHODS: We performed Cox regression analysis to calculate the risk factors for developing a severe breakthrough SARS-CoV-2 infection in the study cohort by controlling for vaccine type, age, sex, comorbidities, and a prior SARS-CoV-2 infection. RESULTS: We found lower hazard ratios for those receiving the Moderna vaccine (P<.001) and Pfizer vaccine (P<.001), with the lowest hazard rates being for Moderna, as compared to those who received the Janssen vaccine, independent of age, sex, comorbidities, vaccine type, and prior SARS-CoV-2 infection. Further, individuals who had a SARS-CoV-2 infection prior to vaccination had some increased protection over and above the protection already provided by the vaccines, from hospitalization (P=.001) and death (P=.04), independent of age, sex, comorbidities, and vaccine type. We found that the top statistically significant risk factors for severe breakthrough SARS-CoV-2 infections were age of >50, male gender, moderate and severe renal failure, severe liver disease, leukemia, chronic lung disease, coagulopathy, and alcohol abuse. CONCLUSIONS: Among individuals who were fully vaccinated, the risk of severe breakthrough SARS-CoV-2 infection was lower for recipients of the Moderna or Pfizer vaccines and higher for recipients of the Janssen vaccine. These results from our analysis at a population level will be helpful to public health policy makers. Our result on the influence of a previous SARS-CoV-2 infection necessitates further research into the impact of multiple exposures on the risk of developing severe COVID-19.


Assuntos
COVID-19 , Vacinas Virais , Humanos , Masculino , COVID-19/epidemiologia , COVID-19/prevenção & controle , SARS-CoV-2 , Vacinação , Hospitalização
9.
Genome Biol ; 23(1): 174, 2022 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-35971180

RESUMO

We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.


Assuntos
Cromatina , Fatores de Transcrição , Sítios de Ligação/genética , Imunoprecipitação da Cromatina , Motivos de Nucleotídeos , Ligação Proteica/genética , Fatores de Transcrição/metabolismo
10.
Sci Rep ; 12(1): 8602, 2022 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-35597791

RESUMO

This work investigates the potential for using Grammatical Evolution (GE) to generate an initial seed for the construction of a pseudo-random number generator (PRNG) and cryptographically secure (CS) PRNG. We demonstrate the suitability of GE as an entropy source and show that the initial seeds exhibit an average entropy value of 7.940560934 for 8-bit entropy, which is close to the ideal value of 8. We then construct two random number generators, GE-PRNG and GE-CSPRNG, both of which employ these initial seeds. We use Monte Carlo simulations to establish the efficacy of the GE-PRNG using an experimental setup designed to estimate the value for pi, in which 100,000,000 random numbers were generated by our system. This returned the value of pi of 3.146564000, which is precise up to six decimal digits for the actual value of pi. We propose a new approach called control_flow_incrementor to generate cryptographically secure random numbers. The random numbers generated with CSPRNG meet the prescribed National Institute of Standards and Technology SP800-22 and the Diehard statistical test requirements. We also present a computational performance analysis of GE-CSPRNG demonstrating its potential to be used in industrial applications.


Assuntos
Método de Monte Carlo
11.
Gigascience ; 10(12)2021 12 29.
Artigo em Inglês | MEDLINE | ID: mdl-34966926

RESUMO

BACKGROUND: Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. RESULTS: We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. CONCLUSIONS: We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.


Assuntos
COVID-19 , SARS-CoV-2 , Algoritmos , Humanos , Mapas de Interação de Proteínas , Proteínas/metabolismo
12.
Sci Rep ; 11(1): 17085, 2021 08 24.
Artigo em Inglês | MEDLINE | ID: mdl-34429468

RESUMO

We present a deep learning approach towards the large-scale prediction and analysis of bird acoustics from 100 different bird species. We use spectrograms constructed on bird audio recordings from the Cornell Bird Challenge (CBC)2020 dataset, which includes recordings of multiple and potentially overlapping bird vocalizations with background noise. Our experiments show that a hybrid modeling approach that involves a Convolutional Neural Network (CNN) for learning the representation for a slice of the spectrogram, and a Recurrent Neural Network (RNN) for the temporal component to combine across time-points leads to the most accurate model on this dataset. We show results on a spectrum of models ranging from stand-alone CNNs to hybrid models of various types obtained by combining CNNs with other CNNs or RNNs of the following types: Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRU), and Legendre Memory Units (LMU). The best performing model achieves an average accuracy of 67% over the 100 different bird species, with the highest accuracy of 90% for the bird species, Red crossbill. We further analyze the learned representations visually and find them to be intuitive, where we find that related bird species are clustered close together. We present a novel way to empirically interpret the representations learned by the LMU-based hybrid model which shows how memory channel patterns change over time with the changes seen in the spectrograms.


Assuntos
Aves/classificação , Aprendizado Profundo , Vocalização Animal/classificação , Animais , Aves/fisiologia
13.
Pac Symp Biocomput ; 26: 154-165, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33691013

RESUMO

Viruses such as the novel coronavirus, SARS-CoV-2, that is wreaking havoc on the world, depend on interactions of its own proteins with those of the human host cells. Relatively small changes in sequence such as between SARS-CoV and SARS-CoV-2 can dramatically change clinical phenotypes of the virus, including transmission rates and severity of the disease. On the other hand, highly dissimilar virus families such as Coronaviridae, Ebola, and HIV have overlap in functions. In this work we aim to analyze the role of protein sequence in the binding of SARS-CoV-2 virus proteins towards human proteins and compare it to that of the above other viruses. We build supervised machine learning models, using Generalized Additive Models to predict interactions based on sequence features and find that our models perform well with an AUC-PR of 0.65 in a class-skew of 1:10. Analysis of the novel predictions using an independent dataset showed statistically significant enrichment. We further map the importance of specific amino-acid sequence features in predicting binding and summarize what combinations of sequences from the virus and the host is correlated with an interaction. By analyzing the sequence-based embeddings of the interactomes from different viruses and clustering them together we find some functionally similar proteins from different viruses. For example, vif protein from HIV-1, vp24 from Ebola and orf3b from SARS-CoV all function as interferon antagonists. Furthermore, we can differentiate the functions of similar viruses, for example orf3a's interactions are more diverged than orf7b interactions when comparing SARS-CoV and SARS-CoV-2.


Assuntos
COVID-19 , SARS-CoV-2 , Sequência de Aminoácidos , Biologia Computacional , Humanos , Proteínas
14.
Nat Methods ; 16(9): 858-861, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31406384

RESUMO

The decoding of transcription factor (TF) binding signals in genomic DNA is a fundamental problem. Here we present a prediction model called BindSpace that learns to embed DNA sequences and TF labels into the same space. By training on binding data from hundreds of TFs and embedding over 1 M DNA sequences, BindSpace achieves state-of-the-art multiclass binding prediction performance, in vitro and in vivo, and can distinguish between signals of closely related TFs.


Assuntos
Algoritmos , Biologia Computacional/métodos , DNA/metabolismo , Aprendizado de Máquina , Fatores de Transcrição/metabolismo , Sítios de Ligação , Imunoprecipitação da Cromatina , DNA/química , Humanos , Ligação Proteica
15.
J Comput Biol ; 24(6): 501-514, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28128642

RESUMO

Disease-causing pathogens such as viruses introduce their proteins into the host cells in which they interact with the host's proteins, enabling the virus to replicate inside the host. These interactions between pathogen and host proteins are key to understanding infectious diseases. Often multiple diseases involve phylogenetically related or biologically similar pathogens. Here we present a multitask learning method to jointly model interactions between human proteins and three different but related viruses: Hepatitis C, Ebola virus, and Influenza A. Our multitask matrix completion-based model uses a shared low-rank structure in addition to a task-specific sparse structure to incorporate the various interactions. We obtain between 7 and 39 percentage points improvement in predictive performance over prior state-of-the-art models. We show how our model's parameters can be interpreted to reveal both general and specific interaction-relevant characteristics of the viruses. Our code is available online.


Assuntos
Algoritmos , Biologia Computacional/métodos , Interações Hospedeiro-Patógeno , Mapas de Interação de Proteínas , Proteínas/metabolismo , Bases de Dados de Proteínas , Ebolavirus/metabolismo , Hepacivirus/metabolismo , Humanos , Vírus da Influenza A/metabolismo , Modelos Moleculares , Conformação Proteica
16.
Front Microbiol ; 6: 45, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25674082

RESUMO

Salmonellosis is the most frequent foodborne disease worldwide and can be transmitted to humans by a variety of routes, especially via animal and plant products. Salmonella bacteria are believed to use not only animal and human but also plant hosts despite their evolutionary distance. This raises the question if Salmonella employs similar mechanisms in infection of these diverse hosts. Given that most of our understanding comes from its interaction with human hosts, we investigate here to what degree knowledge of Salmonella-human interactions can be transferred to the Salmonella-plant system. Reviewed are recent publications on analysis and prediction of Salmonella-host interactomes. Putative protein-protein interactions (PPIs) between Salmonella and its human and Arabidopsis hosts were retrieved utilizing purely interolog-based approaches in which predictions were inferred based on available sequence and domain information of known PPIs, and machine learning approaches that integrate a larger set of useful information from different sources. Transfer learning is an especially suitable machine learning technique to predict plant host targets from the knowledge of human host targets. A comparison of the prediction results with transcriptomic data shows a clear overlap between the host proteins predicted to be targeted by PPIs and their gene ontology enrichment in both host species and regulation of gene expression. In particular, the cellular processes Salmonella interferes with in plants and humans are catabolic processes. The details of how these processes are targeted, however, are quite different between the two organisms, as expected based on their evolutionary and habitat differences. Possible implications of this observation on evolution of host-pathogen communication are discussed.

17.
Front Microbiol ; 6: 36, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25699028

RESUMO

We consider the problem of building a model to predict protein-protein interactions (PPIs) between the bacterial species Salmonella Typhimurium and the plant host Arabidopsis thaliana which is a host-pathogen pair for which no known PPIs are available. To achieve this, we present approaches, which use homology and statistical learning methods called "transfer learning." In the transfer learning setting, the task of predicting PPIs between Arabidopsis and its pathogen S. Typhimurium is called the "target task." The presented approaches utilize labeled data i.e., known PPIs of other host-pathogen pairs (we call these PPIs the "source tasks"). The homology based approaches use heuristics based on biological intuition to predict PPIs. The transfer learning methods use the similarity of the PPIs from the source tasks to the target task to build a model. For a quantitative evaluation we consider Salmonella-mouse PPI prediction and some other host-pathogen tasks where known PPIs exist. We use metrics such as precision and recall and our results show that our methods perform well on the target task in various transfer settings. We present a brief qualitative analysis of the Arabidopsis-Salmonella predicted interactions. We filter the predictions from all approaches using Gene Ontology term enrichment and only those interactions involving Salmonella effectors. Thereby we observe that Arabidopsis proteins involved e.g., in transcriptional regulation, hormone mediated signaling and defense response may be affected by Salmonella.

18.
Bioinformatics ; 29(13): i217-26, 2013 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-23812987

RESUMO

MOTIVATION: An important aspect of infectious disease research involves understanding the differences and commonalities in the infection mechanisms underlying various diseases. Systems biology-based approaches study infectious diseases by analyzing the interactions between the host species and the pathogen organisms. This work aims to combine the knowledge from experimental studies of host-pathogen interactions in several diseases to build stronger predictive models. Our approach is based on a formalism from machine learning called 'multitask learning', which considers the problem of building models across tasks that are related to each other. A 'task' in our scenario is the set of host-pathogen protein interactions involved in one disease. To integrate interactions from several tasks (i.e. diseases), our method exploits the similarity in the infection process across the diseases. In particular, we use the biological hypothesis that similar pathogens target the same critical biological processes in the host, in defining a common structure across the tasks. RESULTS: Our current work on host-pathogen protein interaction prediction focuses on human as the host, and four bacterial species as pathogens. The multitask learning technique we develop uses a task-based regularization approach. We find that the resulting optimization problem is a difference of convex (DC) functions. To optimize, we implement a Convex-Concave procedure-based algorithm. We compare our integrative approach to baseline methods that build models on a single host-pathogen protein interaction dataset. Our results show that our approach outperforms the baselines on the training data. We further analyze the protein interaction predictions generated by the models, and find some interesting insights. AVAILABILITY: The predictions and code are available at: http://www.cs.cmu.edu/∼mkshirsa/ismb2013_paper320.html . SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Proteínas de Bactérias/metabolismo , Interações Hospedeiro-Patógeno , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Humanos
19.
Bioinformatics ; 28(18): i466-i472, 2012 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-22962468

RESUMO

MOTIVATION: Approaches that use supervised machine learning techniques for protein-protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host-pathogen PPI datasets have a large fraction, in the range of 58-85% of missing values, which makes it challenging to apply machine learning algorithms. RESULTS: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ(1)/ℓ(2) regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella-human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia-human PPI prediction successfully, demonstrating the generality of our approach. AVAILABILITY: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html. CONTACT: judithks@cs.cmu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Proteínas de Bactérias/metabolismo , Interações Hospedeiro-Patógeno , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Proteínas de Bactérias/química , Expressão Gênica , Interações Hospedeiro-Patógeno/genética , Humanos , Salmonella/metabolismo , Análise de Sequência de Proteína , Yersinia pestis/metabolismo
20.
Int J Comput Biol Drug Des ; 4(1): 83-105, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21330695

RESUMO

Viruses depend on their hosts at every stage of their life cycles and must therefore communicate with them via Protein-Protein Interactions (PPIs). To investigate the mechanisms of communication by different viruses, we overlay reported pairwise human-virus PPIs on human signalling pathways. Of 671 pathways obtained from NCI and Reactome databases, 355 are potentially targeted by at least one virus. The majority of pathways are linked to more than one virus. We find evidence supporting the hypothesis that viruses often interact with different proteins depending on the targeted pathway. Pathway analysis indicates overrepresentation of some pathways targeted by viruses. The merged network of the most statistically significant pathways shows several centrally located proteins, which are also hub proteins. Generally, hub proteins are targeted more frequently by viruses. Numerous proteins in virus-targeted pathways are known drug targets, suggesting that these might be exploited as potential new approaches to treatments against multiple viruses.


Assuntos
Interações Hospedeiro-Patógeno , Transdução de Sinais , Biologia de Sistemas/métodos , Viroses/metabolismo , Viroses/virologia , HIV-1/fisiologia , Humanos , Mapeamento de Interação de Proteínas , Fenômenos Fisiológicos Virais
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...